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Abstract 

By means of the concept of factorial moments we examine DNA sequences 
from yeast to distinguish coding and non-coding regions. It is found that the 
factorial moments may be a powerful tool for analysis of DNA sequences. 
PACS numbers: 87.15.Cc, 87.10.+e,87.14.Gg 

I. Introduction 

DNA carries the genetic information of most living organisms. And the goal of genome 
projects is to uncover that genetic information. Hence, genomes of many different species, 
ranging from bacteria to complex vertebrates, are currently being sequenced. As automated 
sequencing techniques have started to produce a rapidly growing amount of raw DNA 
sequences, the extraction of information from these sequences becomes a scientific 
challenge. A large fraction of an organism's DNA is not used for encoding protein [1]. 
Hence, one basic task in the analysis of DNA sequences is the identification of coding 
regions. Since biochemical techniques alone are not sufficient for identifying all coding 
regions in every genome, computational tools based on concepts used in many science 
fields have recently played a prominent role. To cite a few examples, we could mention 
gene identification [2-3], assignment of tentative functions to particular sequences [4-7], 
and elucidation of their structure [7-18]. A relevant contribution is due to statistical 
methods, namely Markovian approximations [19], correlation functions, and Fourier 
transform [7,9,10], etc. However, these methods do not give specific information of how 
different regions are characterized, and also fail to distinguish one given species from 
another. For instance, Markovian approximations describe a genome in terms of A>tuple 
overlapping series of nucleotides (where k is the Markovian order) and might ignore some 
correlations. Fourier transforms only detect periodicity and possible correlations, but the 
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information associated with these correlations lacks relevant details about the composition 
of DNA chains. On the other hand, scientists in the field are trying combinations of 
different methods for the recognition of coding and non-coding DNA regions (based on 
techniques such as those mentioned above) in order to improve the accuracy for prediction 
of different packages, which actually reach approximately 90% of accuracy [20-21]. What 
is more, the large amounts of statistical patterns that are different in coding and non-coding 
DNA have been found to be species dependent [22]. That is to say, traditional coding 
measures based upon these patterns need to be trained on organism-specific data sets before 
they can be applied to identify coding DNA. This training set dependence limits the 
applicability of traditional measures, as new genomes are currently being sequenced for 
which training sets do not exist. For these reasons, alternative tools able to give different 
ideas and estimators concerning the structure of DNA chains, especially the statistical 
patterns that are species-independent, represent an important contribution in the field. 
Detailed investigations on the statistical behaviors of DNA, especially on the differences 
between coding and non-coding segments make it possible for us to find methods to 
identify coding segments from DNA sequences theoretically [22-30]. Several novel 
methods have been suggested in literature, such as entropy segmentation, NM method, 
mutual information function, etc. [22-24]. 

According to Li [31], a gene is a sequence of genomic DNA or RNA that performs a 
specific function, a vague definition comparing with the traditional one. Performing the 
function may not require the gene to be translated or even transcribed. Three types of genes 
are recognized at present, e.g., protein-coding genes, RNA-specifying genes and regulatory 
genes. In this present letter the coding segments refer to protein-coding genes. 

In this letter, we suggest the concept of factorial moments as a coding measure. By 
means of the concept of factorial moments we try to identify coding and non-coding 
regions of DNA sequences from yeast. This method uses only the known statistical general 
properties of coding and non-coding segments of DNA. In this way, the prior training on 
known data sets is avoided; furthermore the search for additional biological information 
(such as splice sites or termination signals) can also be avoided. 
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II. Factorial Moments (FM) 

More than ten years have witnessed a remarkably intense experimental and theoretical activity 
in search of scale invariance and fractal in multihardron production processes, for short also called 
"intermittency" [32]. The primary motivation is the expectation that scale invariance or self- 
similarity, analogous to that often encountered in complex non-linear systems, might open new 
avenues ultimately leading to deeper insight into long-distance properties of QCD and the 
unsolved problem of colour confinement. 

Generally, intermittency can be described with the concept of probability moment (PM). 
Dividing a region of phase space A into M bins, the volume of one bin is then 5 = 1/ M . And 
the definition of g-order PM can be written as [33], 



M 
m=l 

Where p m is the probability for a particle occurring in the m'th bin, which satisfies a constrained 

M 

condition, ^ P m = 1 • F° r a self-similar structure, PM will obey a power law as, 
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And D q is called g-order fractal dimension or Renyi dimension. Simple discussions show that 

D ,D l and {D q \q > 2} reflect the geometry, information entropy and particle correlation 

dimensions, respectively. 

It is well known that intermittency is related with strong dynamical fluctuations. But the 
measurements for multihardron production obtain the distribution of particle numbers directly 
instead of the probability distribution. And the finite number of cases will induce statistical 
fluctuations. To describe the strong dynamical fluctuations and dismiss the statistical fluctuations 
effectively, factorial moment (FM) is suggested to investigate intermittency [34,35]. The generally 
used form for FM can be written as, 



M 

F q =M^ 



<n m {n m -\)...(n m -q + \)> 
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Where M is the number of the bins the considered interval being divided into, n m the number of 
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particles occurring in the m'th bin, and n the total number of particles in all the bins. A measure 
quantity can then be introduced to indicate the dynamical fluctuations, 

09_1 H!?lr<T75)' 

Here we present a simple argument for the ability of FM to dismiss statistical fluctuations [33]. 

The statistical fluctuations will obey Bernoulli and Poisson distributions for a system containing 
uncertain and certain number of total particles, respectively. For a system containing uncertain 
total particles, the distribution of particles in the bins can be expressed as, 

Q(n l n 2 ,...n M \ Pi , p 2 ,...p M ) = —— p" 1 p" 2 ...p" M M . 

n x \n 2 \..n M \ 

And (p x ,p 2 ,...p M )are the probabilities for a particle occurring in the 1,2,. ..,M bins, 
respectively. Hence, 

(n,n(n,n -l)-(»m ~ 1 + 1)) 

= \dp x dp 2 ..dp M P(p x ,p 2 ,... p M )><Y, -^Q(n x ,n 2 ,...,n M \p x ,p 2 ,...p M ) 

"l "m 

xnjn m -l)...(n m -q + l) 

= n{n-\)...{n-q + \)x\dp x dp 2 ..dp M P{p x ,p 2 ,...,p M )p q m 

= n(n-l)...(n-q + l)(p q n 
That is to say, 

F q (M) = C q (M) oc M*,\M ^oo. 

Therefore FM can describe the strong dynamical fluctuations and can dismiss the statistical 
fluctuations effectively. 

Besides the statistical fluctuations, there are some trivial dynamical processes that need to be 
dismissed. These trivial dynamical processes induce the average numbers of particles in different 
phase space bins being not same, and the form of FM should be the original one, which reads, 

p _ M -^f <n m (n m -\)...(n m -q + \)> 
h <n m >" 

A typical method to dismiss the fluctuations due to this kind of trivial dynamical processes is to 
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transform the original distribution to homogeneous distribution by means of integrate method as 
follows [36], 



But in this paper we resolve this problem by constructing a series of delay register vectors based 
upon the DNA sequences, as illustrated in the next paragraph. 



The concept of FM has been used to deal with many kinds of complex dynamical 
processes in physics, such as multi-particle production at high energy, DNA melting and 
denaturalization with the temperature increasing, etc. [37-38]. What is more, this concept is 
also improved to a new version called etermittency, to deal with some problems where 
statistical average can not be complemented properly [39]. 

Detailed works predict that in non-coding DNA sequences the elements A, T, C and G 
are not positioned randomly, but exhibit self-similar structure, while in coding DNA 
sequences the elements are distributed in a quasi-random way. Therefore, it may be a 
reasonable idea to distinguish coding and non-coding DNA sequences using the concept of 



There are several statistical features that can be employed to distinguish non-coding and 
coding regions, as illustrated below [40,41], 

(a) The usage of strongly bonded nucleotide C-G pairs is usually less frequent than that 
of weakly bonded A-T pairs; 

(b) The C-G concentration may differ significantly between organisms, but is generally 
larger in coding than in non-coding regions. 

(c) The C-G concentration makes a strong "background" contribution to any possible 
differences between non-coding and coding subsequences. 

(d) Non-coding regions display long-range power-law relations, and have common 
features to hierarchically structured languages, i.e. a linear Zipf plot and a non-zero 
redundancy. That is to say, there are deterministic structures in non-coding regions. 



x(y) = 





III. Application to DNA analysis 



FM. 
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While for coding regions, it seems that random rules dominate the sequences. 
Therefore, the coding and non-coding regions behave different completely. They are 
sequences obeying different laws. To take into account these statistical characteristics of 
DNA sequences, we construct a process as illustrated below [42-46], 

(a) d successive nucleotides along a DNA sequence are regarded as a case containing d 
particles. The state of the case can be described with a d-dimensional vector as 

(x 1 ,x 2 ,x 3 ...x d ) , where x t is the state value for the i'th nucleotide. We can 

define the state values according to our counting rules. In this paper x t is set to be 1 

when the /' th position is occupied with C or G, and for A or T. 

(b) For a segment with length N, the total possible N —d + 1 successive cases form a 
process. The process covers the entire DNA segment we are interested in, which can 
be expressed with the series in d-dimensional delay-register vectors: 

( JCj , X2 5 X^ ■ ■ -Xrf ) 
(x 2 , x 3 , X 4 . . -X d+l ) 

I 

( X N-d+i ' X N-d+2 ' X N-d+3 ■•••*#) 

For each case we can reckon the number of occurrences of the nucleotides C and G. 
Then the density spectrum p m (i.e., m distribution, normalized to unity) is obtained based 

upon the number of occurrences in all the cases. In this paper F with q=4 are calculated. 

Obviously F with other values of q, such as 5,6, can be gained easily if necessary. To 

indicate the differences of processes constructed above which reflect the behaviors of 
different regions in DNA sequence, we introduce a measure quantity as below, 



A F (t) = 



52 ( F m ~ F tm ) 



E 1 

m 

Where mis the length of a case, F 0m smdF m are FM of the initial process (i.e. the 
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region for reference) and the t' th process, respectively. If the t' th region behaves similar 
with the initial one AF(t) will tend to zero, while AF(t) will be a definite non-zero 

value when the successive processes step into a region obeying different laws comparing 
with the initial one. What is more, two regions with similar behaviors will have almost 

same values of AF(t) . 

In Fig.(l) we shows the results for DNA sequences from Yeast. The unitary values of 
AF(t) are presented here. The initial part l-1200bp is chosen to be the reference segment. 

The length of a case is set to be 10,20,30,40,50,60, respectively. The length of a segment 
used to construct a process is 1200bp. We can find that the right borders for almost all the 
coding regions occur at the bottoms of valleys. Because we can get the positions of valleys 
with a considerable precision, the right borders can be determined with the FM 
appropriately. 

In Fig.(2) the left borders are determined. Firstly the considered DNA sequence is 
arranged in an inverse order, e.g., numbering the initial DNA sequence denoted with 

1,2,3...N with N, N-l, ...1. Then the unitary AF(t) values are calculated. The positions 
of valleys can fit with the left borders very well. 

Here we meet an essential problem, that is, how can we find a proper segment of DNA 
sequence to be employed as reference. Bad reference may induce fuzzy results. 
Investigations on the differences among coding segments or non-coding segments may be 
helpful, and the FM method is clearly a powerful tool. It is interesting to find in the results 
above that the right borders or the left borders are almost all positioned around two typical 
values, respectively. Perhaps we can catalogue the borders according to the quantity 

AF(t) in a certain degree. 
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